# Sourcing
Rcpp::sourceCpp("dist_par.cpp")
# Simulating data
set.seed(1231)
K <- 50
n <- 10000
x <- matrix(rnorm(n*K), ncol=K)
# Are we getting the same?
hist(as.matrix(dist(x)) - dist_par(x, 4)) # Only zeros
We have the following C++ function that needs to be sped up:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
//' Fourth order biweight kernel
// [[Rcpp::export]]
arma::mat K4B_v1(
arma::mat X,
arma::mat Y,
arma::vec h
) {
arma::uword n_n = X.n_rows;
arma::uword n_m = Y.n_cols;
arma::mat Nhat(n_n, n_m);
arma::vec Dhat(n_n);
arma::mat Yhat(n_n, n_m);
for (arma::uword i = 0; i < n_n; ++i)
{
const auto xrow_i = X.row(i);
for (arma::uword j = 0; j < i; ++j)
{
arma::vec Dji_h = (X.row(j) - xrow_i) / h;
auto Dji_h2 = arma::pow(Dji_h, 2.0);
double Kji_h = prod(
(arma::abs(Dji_h) < 1) %
(1.0 - 3.0 * Dji_h2) %
arma::pow(1.0 - Dji_h2, 2.0) * 105.0 / 64.0
);
Dhat(i) += Kji_h;
Dhat(j) += Kji_h;
Nhat.row(i) += Y.row(j) * Kji_h;
Nhat.row(j) += Y.row(i) * Kji_h;
}
}
for (size_t i = 0u; i < n_n; ++i)
{
if (Dhat(i) != 0)
Yhat.row(i) = Nhat.row(i)/Dhat(i);
}
return(Yhat);
}We will use OpenMP to accelerate the function.
RcppArmadillo + OpenMP
=
Friendlier than RcppParallel… at least for ‘I-use-Rcpp-but-don’t-actually-know-much-about-C++’ users (like myself!).
Must run only ‘Thread-safe’ calls, so calling R within parallel blocks can cause problems (almost all the time).
Use arma objects, e.g. arma::mat, arma::vec, etc. Or, if you are used to them std::vector objects as these are thread-safe.
Pseudo Random Number Generation is not very straightforward… But C++11 has a nice set of functions that can be used together with OpenMP
Need to think about how processors work, cache memory, etc. Otherwise, you could get into trouble… if your code is slower when run in parallel, then you probably are facing false sharing
If R crashes… try running R with a debugger (see Section 4.3 in Writing R extensions):
~$ R --debugger=valgrindTell Rcpp that you need to include that in the compiler:
#include <omp.h>
// [[Rcpp::plugins(openmp)]]Within your function, set the number of cores, e.g
// Setting the cores
omp_set_num_threads(cores);Tell the compiler that you’ll be running a block in parallel with OpenMP
#pragma omp [directives] [options]
{
...your neat parallel code...
}You’ll need to specify how OMP should handle the data:
shared: Default, all threads access the same copy.private: Each thread has its own copy, uninitialized.firstprivate Each thread has its own copy, initialized.lastprivate Each thread has its own copy. The last value used is returned.Setting default(none) is a good practice.
Compile!
Our own version of the dist function… but in parallel!
#include <omp.h>
#include <RcppArmadillo.h>
using namespace Rcpp;
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::plugins(openmp)]]
// [[Rcpp::export]]
arma::mat dist_par(const arma::mat & X, int cores = 1) {
// Some constants
int N = (int) X.n_rows;
int K = (int) X.n_cols;
// Output
arma::mat D(N,N);
D.zeros(); // Filling with zeros
// Setting the cores
omp_set_num_threads(cores);
#pragma omp parallel for shared(D, N, K, X) default(none)
for (int i=0; i < N; ++i)
{
for (int j=0; j < i; ++j)
{
// Computing the distance
D.at(i,j) = std::sqrt(
arma::sum(arma::pow(X.row(i) - X.row(j), 2.0))
);
// Computing square root
D.at(j,i) = D.at(i,j);
}
}
// My nice distance matrix
return D;
}The key lines:
#include <omp.h> Includes the OpenMP library in our program.
// [[Rcpp::plugins(openmp)]] Notifies Rcpp::sourceCpp we are using OpenMP.
omp_set_num_threads(cores) Sets the number of cores.
#pragma omp parallel for... Specifies the instruction.
Let’s see how much faster our function is
# Sourcing
Rcpp::sourceCpp("dist_par.cpp")
# Simulating data
set.seed(1231)
K <- 50
n <- 10000
x <- matrix(rnorm(n*K), ncol=K)
# Are we getting the same?
hist(as.matrix(dist(x)) - dist_par(x, 4)) # Only zeros
# Benchmarking!
bm <- microbenchmark::microbenchmark(
dist(x), # stats::dist
dist_par(x, cores = 1), # 1 core
dist_par(x, cores = 4), # 4 cores
dist_par(x, cores = 8), # 8 cores
times = 1
)Warning in microbenchmark::microbenchmark(dist(x), dist_par(x, cores = 1), :
less accurate nanosecond times to avoid potential integer overflows
print(bm, unit = "s")Unit: seconds
expr min lq mean median uq
dist(x) 1.9862943 1.9862943 1.9862943 1.9862943 1.9862943
dist_par(x, cores = 1) 1.0103363 1.0103363 1.0103363 1.0103363 1.0103363
dist_par(x, cores = 4) 0.8367847 0.8367847 0.8367847 0.8367847 0.8367847
dist_par(x, cores = 8) 0.3581303 0.3581303 0.3581303 0.3581303 0.3581303
max neval
1.9862943 1
1.0103363 1
0.8367847 1
0.3581303 1
print(bm, unit = "relative")Unit: relative
expr min lq mean median uq max
dist(x) 5.546290 5.546290 5.546290 5.546290 5.546290 5.546290
dist_par(x, cores = 1) 2.821142 2.821142 2.821142 2.821142 2.821142 2.821142
dist_par(x, cores = 4) 2.336537 2.336537 2.336537 2.336537 2.336537 2.336537
dist_par(x, cores = 8) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
neval
1
1
1
1
There’s more to OpenMP. I invite you to go over a set of experiments where I compare different strategies for implementing the same function here: https://github.com/UofUEpiBio/r-parallel-benchmark
Before continuing to the next section. Try to apply what we just learned to include OpenMP in the function K4B_v1. To do so, write a C++ function named K4B_v2 and try it out using the following code:
n <- 500
Y <- matrix(rnorm(n * n), ncol = n)
X <- cbind(rnorm(n))
# Running the benchmark
res <- microbenchmark::microbenchmark(
K4B_v1(X, Y, 2),
K4B_v2(X, Y, 2, 4),
check = "equal",
times = 10
)
print(res, unit = "relative")
print(res, unit = "ms")
# Unit: relative
# expr min lq mean median uq max neval
# K4B_v1(X, Y, 2) 3.32887 3.356954 2.268484 2.412626 1.781527 1.462261 10
# K4B_v2(X, Y, 2, 4) 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000 10
# cld
# b
# a
# Unit: milliseconds
# expr min lq mean median uq max
# K4B_v1(X, Y, 2) 227.04614 229.69930 231.3196 231.48199 233.8489 234.0478
# K4B_v2(X, Y, 2, 4) 68.20517 68.42492 101.9710 95.94608 131.2631 160.0588
# neval cld
# 10 b
# 10 a We need to execute the following Rscript in CHPC1:
library(Rcpp)
# Compiling
Rcpp::sourceCpp("distance.cpp", verbose = TRUE)
id <- as.integer(Sys.getenv("SLURM_ARRAY_TASK_ID", unset = "0"))
message("We are working in array #", id)
is_array <- ifelse(
Sys.getenv("SLURM_ARRAY_TASK_ID", unset = "") == "",
FALSE,
TRUE
)
# Generating seeds
set.seed(123)
# Need to make sure all arrays have different seeds
if (is_array) {
seeds <- as.integer(Sys.getenv("SLURM_ARRAY_TASK_COUNT", 1))
seeds <- sample.int(.Machine$integer.max, seeds)
set.seed(seeds[id])
}
n <- 500
Y <- matrix(rnorm(n * n), ncol = n)
X <- cbind(rnorm(n))
# Running the benchmark
res <- microbenchmark::microbenchmark(
K4B_v1(X, Y, 2),
K4B_v2(X, Y, 2, 4),
check = "equal", times = 10
)
print(res, unit = "relative")
print(res, unit = "ms")
png(
ifelse(
is_array,
sprintf("openmp-example-%02i.png", id),
"openmp-exampl.png"
)
)
boxplot(res, unit = "ms")
dev.off()Two ways of doing it: Single or multiple jobs (using job arrays.) To submit the job, I have created two bash (slurm) scripts we can submit with the sbatch function. Most of the time, we will be submitting single jobs (time-consuming operations, large memory, etc.) Job arrays are useful when the entire script can be parallelized. Submitting jobs using sbatch is straightforward:
sbatch [your-slurm-script]Where your-slurm-script should be replaced with the name (and path, if needed) of the Slurm script. To check the status of your submitted jobs, you can use the squeue command, e.g.,
squeue -u[your user id]Where your user id should be replaced with… your user ID. This is equivalent to typing the following command
squeue -u$(id -u)Once your job is submitted, the R script should generate a boxplot like the following:

The distance-job.slurm contains the following lines:
#!/bin/sh
#SBATCH --job-name=rcpp-openmp-array
#SBATCH --account=vegayon-np
#SBATCH --partition=vegayon-shared-np
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=00:05:00
#SBATCH --output=slurmlog.log
# Loading R
module load R/4.2.2
# Running the script
Rscript --vanilla distance.RJob arrays are a way to submit the same job multiple times. The benefits of using job arrays are manyfold. One of the key advantages of this type of submission is that a single script can be executed across many nodes; making job arrays a powerful tool.
When job arrays start, Slurm generates a set of environment variables available to the user. One of those is the SLURM_ARRAY_TASK_ID environment variable. With the latter, we can identify which of the n tasks is the job running. In our R script, we do so using the Sys.getenv() function (to which we assign the object id):
# If not in a job array, the default value of id is "0"
id <- as.integer(Sys.getenv("SLURM_ARRAY_TASK_ID", unset = "0"))That way, our R script can be instructed to behave differently depending on the array id. If your R script depends on generating random numbers (e.g., a simulation study,) it is paramount to properly deal with RNG seeds. A common way to do so is to set a seed for a baseline job, and then sample individual seeds for each one of the tasks. We do that as follows:
# Generating seeds
set.seed(123)
# Need to make sure all arrays have different seeds
if (is_array) {
seeds <- as.integer(Sys.getenv("SLURM_ARRAY_TASK_COUNT", 1))
seeds <- sample.int(.Machine$integer.max, seeds)
set.seed(seeds[id])
}The distance-job-array.slurm contains the following lines:
#!/bin/sh
#SBATCH --job-name=rcpp-openmp-array
#SBATCH --account=vegayon-np
#SBATCH --partition=vegayon-shared-np
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=00:05:00
#SBATCH --output=slurmlog%a.log
#SBATCH --array=1-2
# Loading R
module load R/4.2.2
# Running the script
Rscript --vanilla distance.R